NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Adaptive Draft-Verification for Efficient Large Language Model Decoding

https://doi.org/10.1609/aaai.v39i23.34647

Liu, Xukun; Lei, Bowen; Zhang, Ruqi; Xu, Dongkuan DK (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios.The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or relying on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts.To address these issues, we introduce a novel methodology called Adaptix, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM.The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that Adaptix significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.
more » « less
Full Text Available
Analysis of Climate Campaigns on Social Media using Bayesian Model Averaging

https://doi.org/10.1145/3600211.3604665

Islam, Tunazzina; Zhang, Ruqi; Goldwasser, Dan (August 2023, ACM)
Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI

Papamarkou, Theodore; Skoularidou, Maria; Palla, Konstantina; Aitchison, Laurence; Arbel, Julyan; Dunson, David; Filippone, Maurizio; Fortuin, Vincent; Hennig, Philipp; Hernandez-Lobato, Jose_Miguel; et al (July 2024, Proceedings of Machine Learning Research)

Full Text Available
Low-Precision Stochastic Gradient Langevin Dynamics

Zhang, Ruqi; Wilson, Andrew Gordon; De Sa, Christopher (July 2022, Proceedings of Machine Learning Research)
Kamalika Chaudhuri, Stefanie Jegelka (Ed.)
While low-precision optimization has been widely used to accelerate deep learning, low-precision sampling remains largely unexplored. As a consequence, sampling is simply infeasible in many large-scale scenarios, despite providing remarkable benefits to generalization and uncertainty estimation for neural networks. In this paper, we provide the first study of low-precision Stochastic Gradient Langevin Dynamics (SGLD), showing that its costs can be significantly reduced without sacrificing performance, due to its intrinsic ability to handle system noise. We prove that the convergence of low-precision SGLD with full-precision gradient accumulators is less affected by the quantization error than its SGD counterpart in the strongly convex setting. To further enable low-precision gradient accumulators, we develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
more » « less
Full Text Available

Search for: All records